The Excel document influenza.xlsx contains weekly data on the mortality and the number of laboratory-confirmed cases of influenza in Sweden. In addition, there is information about population-weighted temperature anomalies (temperature deficits).
| Year | Week | Time | Mortality | Influenza | Temperature deficit |
|---|---|---|---|---|---|
| 1995 | 1 | 1995.000 | 1940 | 0 | 1.34518 |
| 1995 | 2 | 1995.019 | 1898 | 0 | 0.00000 |
| 1995 | 3 | 1995.038 | 1812 | 0 | 0.00000 |
| 1995 | 4 | 1995.058 | 1869 | 6 | 1.40262 |
| 1995 | 5 | 1995.077 | 1857 | 2 | 0.00000 |
| 1995 | 6 | 1995.096 | 1802 | 4 | 0.00000 |
Task: Use time series plots to visually inspect how the mortality and influenza number vary with time (use Time as X axis). By using this plot, comment how the amounts of influenza cases are related to mortality rates.
Answer: One plot for mortality and one for influenza. To better compare those two and show, if they are related or not, there is a third plot showing them in one plot.
If both plots are superimposed we can observe that it seems like that mortality in influenced by influenza.
Task: Use gam() function from mgcv package to fit a GAM model in which Mortality is normally distributed and modelled as a linear function of Year and spline function of Week, and make sure that the model parameters are selected by the generalized cross-validation. Report the underlying probabilistic model.
##
## Family: gaussian
## Link function: identity
##
## Formula:
## Mortality ~ Year + s(Week)
##
## Estimated degrees of freedom:
## 8.59 total = 10.59
##
## GCV score: 9014.609
The probabilistic model looks as follows:
\[ Model = \mathcal{N}(\beta_{year} * X_{Year} +S_{week} * X_{week} + \alpha, \sigma^2) \]
Task: Plot predicted and observed mortality against time for the fitted model and comment on the quality of the fit. Investigate the output of the GAM model and report which terms appear to be significant in the model. Is there a trend in mortality change from one year to another? Plot the spline component and interpret the plot.
##
## Family: gaussian
## Link function: identity
##
## Formula:
## Mortality ~ Year + s(Week)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -652.058 3448.379 -0.189 0.85
## Year 1.219 1.725 0.706 0.48
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(Week) 8.587 8.951 100.6 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.661 Deviance explained = 66.8%
## GCV = 9014.6 Scale est. = 8806.7 n = 459
Comment: We can see that the fitted data is actually not performing that bad. It recognized the cyclical trend correctly. The drawback of the smoothing is that the extrem values are not correctly predicted, as those are smoothed out.
We observe that the spline curve is following the trend we can observe over the years, with a peak in the beginning/end of the year.
Task: Examine how the penalty factor of the spline function in the GAM model from step 2 influences the estimated deviance of the model. Make plots of the predicted and observed mortality against time for cases of very high and very low penalty factors. What is the relation of the penalty factor to the degrees of freedom? Do your results confirm this relationship?
Task: Use the model obtained in step 2 and plot the residuals and the influenza values against time (in one plot). Is the temporal pattern in the residuals correlated to the outbreaks of influenza?
Comment: Yes it seems like that the residuals and the outbreaks are correlated. When there is an outbreak, the residuals also inscrease in that particular moment.
Task: Fit a GAM model in R in which mortality is be modelled as an additive function of the spline functions of year, week, and the number of confirmed cases of influenza. Use the output of this GAM function to conclude whether or not the mortality is influenced by the outbreaks of influenza. Provide the plot of the original and fitted Mortality against Time and comment whether the model seems to be better than the previous GAM models.
##
## Family: gaussian
## Link function: identity
##
## Formula:
## Mortality ~ Year + s(Week, sp = penalty)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1173.5135 4211.1335 0.279 0.781
## Year 0.3053 2.1067 0.145 0.885
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(Week) 1.639 2.036 171.9 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.494 Deviance explained = 49.7%
## GCV = 13263 Scale est. = 13157 n = 459
TODO: Is this model better and does the number of influenza influece the mortality?
The data file data.csv contains information about 64 e-mails which were manually collected from DBWorld mailing list. They were classified as: ‘announces of conferences’ (1) and ‘everything else’ (0) (variable Conference)
| X000euro | X5102011 | X10th | X11th | X12noon | X12th | X13th | X14th | X15th | X16th |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
## Call:
## pamr.cv(fit = nsc_model, data = mydata)
## threshold nonzero errors
## 1 0.0 3428 6
## 2 0.1 3409 6
## 3 0.2 3114 7
## 4 0.3 3024 7
## 5 0.4 3000 7
## 6 0.5 1979 6
## 7 0.6 852 6
## 8 0.7 841 6
## 9 0.8 673 6
## 10 0.9 622 6
## 11 1.0 297 6
## 12 1.1 293 6
## 13 1.2 272 6
## 14 1.3 231 5
## 15 1.4 170 5
## 16 1.5 138 6
## 17 1.6 129 7
## 18 1.7 98 7
## 19 1.8 88 7
## 20 1.9 71 7
## 21 2.0 62 7
## 22 2.1 47 7
## 23 2.2 43 7
## 24 2.3 36 7
## 25 2.4 30 7
## 26 2.5 20 7
## 27 2.6 20 7
## 28 2.7 14 7
## 29 2.8 12 7
## 30 2.9 12 9
## 31 3.0 12 9
## 32 3.1 11 9
## 33 3.2 9 10
## 34 3.3 9 11
## 35 3.4 6 11
## 36 3.5 6 13
## 37 3.6 6 14
## 38 3.7 6 16
## 39 3.8 4 16
## 40 3.9 2 20
## 41 4.0 1 20
Compute the test error and the number of the contributing features for the following methods fitted to the training data:
Compare the results of these models with the results of the nearest shrunken centroids (make a comparative table). Which model would you prefer and why?
## Setting default kernel parameters
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
##
## Call: glmnet(x = x, y = y, family = "binomial", alpha = 0.5)
##
## Df %Dev Lambda
## [1,] 0 -1.624e-16 0.668200
## [2,] 1 1.301e-02 0.637800
## [3,] 3 3.387e-02 0.608800
## [4,] 6 5.984e-02 0.581100
## [5,] 6 8.816e-02 0.554700
## [6,] 6 1.149e-01 0.529500
## [7,] 7 1.403e-01 0.505400
## [8,] 7 1.657e-01 0.482500
## [9,] 9 1.900e-01 0.460500
## [10,] 9 2.141e-01 0.439600
## [11,] 10 2.375e-01 0.419600
## [12,] 10 2.596e-01 0.400600
## [13,] 10 2.805e-01 0.382300
## [14,] 11 3.012e-01 0.365000
## [15,] 12 3.216e-01 0.348400
## [16,] 12 3.409e-01 0.332500
## [17,] 12 3.592e-01 0.317400
## [18,] 13 3.775e-01 0.303000
## [19,] 13 3.952e-01 0.289200
## [20,] 15 4.123e-01 0.276100
## [21,] 17 4.323e-01 0.263500
## [22,] 20 4.524e-01 0.251600
## [23,] 24 4.737e-01 0.240100
## [24,] 25 4.951e-01 0.229200
## [25,] 26 5.156e-01 0.218800
## [26,] 27 5.354e-01 0.208800
## [27,] 27 5.544e-01 0.199400
## [28,] 28 5.724e-01 0.190300
## [29,] 29 5.895e-01 0.181600
## [30,] 32 6.062e-01 0.173400
## [31,] 32 6.221e-01 0.165500
## [32,] 33 6.373e-01 0.158000
## [33,] 34 6.518e-01 0.150800
## [34,] 35 6.657e-01 0.144000
## [35,] 36 6.790e-01 0.137400
## [36,] 38 6.916e-01 0.131200
## [37,] 39 7.037e-01 0.125200
## [38,] 40 7.152e-01 0.119500
## [39,] 39 7.261e-01 0.114100
## [40,] 39 7.365e-01 0.108900
## [41,] 40 7.464e-01 0.103900
## [42,] 41 7.559e-01 0.099220
## [43,] 42 7.649e-01 0.094710
## [44,] 43 7.734e-01 0.090410
## [45,] 43 7.816e-01 0.086300
## [46,] 45 7.894e-01 0.082370
## [47,] 45 7.968e-01 0.078630
## [48,] 50 8.040e-01 0.075060
## [49,] 51 8.108e-01 0.071640
## [50,] 51 8.173e-01 0.068390
## [51,] 52 8.235e-01 0.065280
## [52,] 53 8.294e-01 0.062310
## [53,] 54 8.351e-01 0.059480
## [54,] 54 8.405e-01 0.056780
## [55,] 55 8.456e-01 0.054200
## [56,] 57 8.505e-01 0.051730
## [57,] 57 8.552e-01 0.049380
## [58,] 57 8.596e-01 0.047140
## [59,] 60 8.639e-01 0.045000
## [60,] 68 8.680e-01 0.042950
## [61,] 72 8.719e-01 0.041000
## [62,] 74 8.756e-01 0.039130
## [63,] 77 8.791e-01 0.037360
## [64,] 80 8.825e-01 0.035660
## [65,] 80 8.858e-01 0.034040
## [66,] 80 8.889e-01 0.032490
## [67,] 80 8.918e-01 0.031010
## [68,] 80 8.946e-01 0.029600
## [69,] 83 8.973e-01 0.028260
## [70,] 84 8.998e-01 0.026970
## [71,] 85 9.023e-01 0.025750
## [72,] 85 9.046e-01 0.024580
## [73,] 86 9.068e-01 0.023460
## [74,] 86 9.089e-01 0.022390
## [75,] 86 9.109e-01 0.021380
## [76,] 86 9.128e-01 0.020400
## [77,] 86 9.147e-01 0.019480
## [78,] 88 9.164e-01 0.018590
## [79,] 88 9.181e-01 0.017750
## [80,] 88 9.197e-01 0.016940
## [81,] 108 9.212e-01 0.016170
## [82,] 123 9.227e-01 0.015440
## [83,] 128 9.241e-01 0.014730
## [84,] 129 9.254e-01 0.014060
## [85,] 129 9.267e-01 0.013420
## [86,] 143 9.279e-01 0.012810
## [87,] 150 9.291e-01 0.012230
## [88,] 156 9.302e-01 0.011680
## [89,] 159 9.312e-01 0.011150
## [90,] 161 9.322e-01 0.010640
## [91,] 173 9.332e-01 0.010160
## [92,] 188 9.341e-01 0.009694
## [93,] 189 9.350e-01 0.009253
## [94,] 190 9.359e-01 0.008833
## [95,] 192 9.367e-01 0.008431
## [96,] 183 9.374e-01 0.008048
## [97,] 186 9.382e-01 0.007682
## [98,] 188 9.389e-01 0.007333
## [99,] 191 9.395e-01 0.007000
## [100,] 191 9.402e-01 0.006682
## Length Class Mode
## a0 100 -none- numeric
## beta 470200 dgCMatrix S4
## df 100 -none- numeric
## dim 2 -none- numeric
## lambda 100 -none- numeric
## dev.ratio 100 -none- numeric
## nulldev 1 -none- numeric
## npasses 1 -none- numeric
## jerr 1 -none- numeric
## offset 1 -none- logical
## classnames 2 -none- character
## call 5 -none- call
## nobs 1 -none- numeric
Implement Benjamini-Hochberg method for the original data, and use t.test() for computing p-values. Which features correspond to the rejected hypotheses? Interpret the result.
| feature_names | p_values | cv_scores |
|---|---|---|
| papers | 0.0000000 | 0.0000106 |
| submission | 0.0000000 | 0.0000213 |
| position | 0.0000000 | 0.0000319 |
| published | 0.0000002 | 0.0000425 |
| important | 0.0000003 | 0.0000532 |
| call | 0.0000004 | 0.0000638 |
| conference | 0.0000005 | 0.0000744 |
| candidates | 0.0000009 | 0.0000851 |
| dates | 0.0000014 | 0.0000957 |
| paper | 0.0000014 | 0.0001063 |
| topics | 0.0000051 | 0.0001170 |
| limited | 0.0000079 | 0.0001276 |
| candidate | 0.0000119 | 0.0001382 |
| camera | 0.0000210 | 0.0001489 |
| ready | 0.0000210 | 0.0001595 |
| authors | 0.0000215 | 0.0001701 |
| phd | 0.0000338 | 0.0001808 |
| projects | 0.0000350 | 0.0001914 |
| org | 0.0000374 | 0.0002020 |
| chairs | 0.0000586 | 0.0002127 |
| due | 0.0000649 | 0.0002233 |
| original | 0.0000649 | 0.0002339 |
| notification | 0.0000688 | 0.0002446 |
| salary | 0.0000797 | 0.0002552 |
| record | 0.0000909 | 0.0002658 |
| skills | 0.0000909 | 0.0002765 |
| held | 0.0001529 | 0.0002871 |
| team | 0.0001758 | 0.0002977 |
| pages | 0.0002007 | 0.0003084 |
| workshop | 0.0002007 | 0.0003190 |
| committee | 0.0002117 | 0.0003296 |
| proceedings | 0.0002117 | 0.0003403 |
| apply | 0.0002166 | 0.0003509 |
| strong | 0.0002246 | 0.0003615 |
| international | 0.0002296 | 0.0003722 |
| degree | 0.0003762 | 0.0003828 |
| excellent | 0.0003762 | 0.0003934 |
| post | 0.0003762 | 0.0004041 |
| presented | 0.0003765 | 0.0004147 |
knitr::opts_chunk$set(echo = TRUE)
library(readxl)
library(ggplot2)
library(knitr)
library(mgcv)
library(pamr)
library(kernlab)
library(glmnet)
set.seed(1234567890)
################################################################################
# 1) Using GAM and GLM to examine the mortality rates
################################################################################
set.seed(12345)
influanza = read_excel("./influenza.xlsx")
#creditscoring$good_bad = as.factor(creditscoring$good_bad)
kable(head(influanza), caption = "influenza.xlsx")
################################################################################
# 1.1)
################################################################################
ggplot(influanza) +
geom_line(aes(x = influanza$Time, y = influanza$Mortality,
colour = "Mortality Time Series")) +
labs(title = "Mortality", y = "Mortality",
x = "Time", color = "Legend") +
scale_color_manual(values = c("blue"))
ggplot(influanza) +
geom_line(aes(x = influanza$Time, y = influanza$Influenza,
colour = "Influenza Time Series")) +
labs(title = "Influenza", y = "Influenza",
x = "Time", color = "Legend") +
scale_color_manual(values = c("orange"))
ggplot(influanza) +
geom_line(aes(x = influanza$Time, y = influanza$Mortality,
colour = "Mortality Time Series")) +
geom_line(aes(x = influanza$Time, y = influanza$Influenza,
colour = "Influenza Time Series")) +
labs(title = "Superimposed Time Series", y = "Values",
x = "Time", color = "Legend") +
scale_color_manual(values = c("red", "brown"))
################################################################################
# 1.2)
################################################################################
# - Mortality is normally distributed and modelled as...
# - ...a linear function of Year...
# - ...and spline function of Week
# Don't forget to select the model bei generalized cross-validation
gam_model = gam(formula = Mortality ~ Year + s(Week), family = gaussian(), data = influanza, method="GCV.Cp")
print(gam_model)
summary(gam_model)
plot(gam_model)
time = influanza$Time
observed = influanza$Mortality
predicted = gam_model$fitted.values
mortality_values = data.frame(time, observed, predicted)
ggplot(mortality_values) +
geom_line(aes(x = time, y = observed,
colour = "Observed Mortality")) +
geom_line(aes(x = time, y = predicted,
colour = "Predicted Mortality")) +
labs(title = "Observed vs Predicted Mortality", y = "Mortality",
x = "Time", color = "Legend") +
scale_color_manual(values = c("blue", "orange"))
ggplot() +
geom_line(aes(x=influanza$Week,y=influanza$Mortality,
colour=as.factor(influanza$Year))) +
labs(x='Week', y='Mortality', colour='Year',
title='Mortality by Week and by Year')
penalties = seq(from = 0 , to = 20, by = 0.1)
index = 1
fitted_gam_with_penalties = data.frame()
for (penalty in penalties) {
gam_model = gam(formula = Mortality ~ Year + s(Week, sp = penalty),
family = gaussian(), data = influanza, method="GCV.Cp")
fitted_gam_with_penalties = rbind(fitted_gam_with_penalties,
data.frame(list(penalty = penalty,
deviance = gam_model$deviance,
df = sum(influence(gam_model)))))
}
# Deviance VS Penalty
ggplot(fitted_gam_with_penalties) +
geom_line(aes(x = fitted_gam_with_penalties$penalty,
y = fitted_gam_with_penalties$deviance,
colour = "Deviance")) +
labs(title = "Deviance VS Penalty", y = "Deviance",
x = "Penalty", color = "Legend") +
scale_color_manual(values = c("blue", "orange"))
## Create Models for High and Low penalties
gam_model_low_pen = gam(formula = Mortality ~ Year + s(Week, sp = 0.1),
family = gaussian(), data = influanza, method="GCV.Cp")
gam_model_high_pen = gam(formula = Mortality ~ Year + s(Week, sp = 20),
family = gaussian(), data = influanza, method="GCV.Cp")
high_low_df = data.frame()
high_low_df = rbind(high_low_df,
list(mortality = influanza$Mortality,
mortality_low = fitted(gam_model_low_pen),
mortality_high = fitted(gam_model_high_pen),
date = influanza$Time))
ggplot(high_low_df) +
geom_line(aes(x = date, y = mortality, colour = "Mortality")) +
geom_line(aes(x = date, y = mortality_low, colour = "Mortality (Low)")) +
geom_line(aes(x = date, y = mortality_high, colour = "Mortality (High)")) +
labs(title = "Mortalities vs Time", y = "Mortality",
x = "Time", color = "Legend") +
scale_color_manual(values = c("blue", "orange", "violet"))
ggplot(mortality_values) +
geom_line(aes(x = time, y = observed,
colour = "Observed Mortality")) +
geom_line(aes(x = time, y = predicted,
colour = "Predicted Mortality")) +
labs(title = "Observed vs Predicted Mortality", y = "Mortality",
x = "Time", color = "Legend") +
scale_color_manual(values = c("blue", "orange"))
time = influanza$Time
resid = gam_model$residuals
influenza_data = influanza$Influenza
print_data = data.frame(time, resid, influenza_data)
ggplot(print_data) +
geom_line(aes(x = time, y = resid,
colour = "Residuals")) +
geom_line(aes(x = time, y = influenza_data,
colour = "Influenza")) +
labs(title = "Residuals and Influenza vs Time", y = "Residuals and Mortality",
x = "Time", color = "Legend") +
scale_color_manual(values = c("blue", "orange"))
gam_model_additive = gam(formula = Mortality ~ Year + s(Week) + Influenza, family = gaussian(), data = influanza, method="GCV.Cp")
summary(gam_model)
plot(gam_model)
time = influanza$Time
observed = influanza$Mortality
predicted = gam_model_additive$fitted.values
mortality_values = data.frame(time, observed, predicted)
ggplot(mortality_values) +
geom_line(aes(x = time, y = observed,
colour = "Observed Mortality")) +
geom_line(aes(x = time, y = predicted,
colour = "Predicted Mortality")) +
labs(title = "Observed vs Predicted Mortality", y = "Mortality",
x = "Time", color = "Legend") +
scale_color_manual(values = c("blue", "orange"))
################################################################################
# 2) High-dimensional method
################################################################################
emails = read.csv2("data.csv", fileEncoding = "ISO-8859-1", sep = ";")
emails$Conference = as.factor(emails$Conference)
kable(head(emails[,1:10]))
n = dim(emails)[1]
id = sample(1:n, floor(n*0.70))
train_emails = emails[id,]
val_emails = emails[-id,]
################################################################################
# 2.1)
################################################################################
rownames(train_emails) = 1:nrow(train_emails)
x = t(train_emails[,-ncol(train_emails)])
y = train_emails[[ncol(train_emails)]]
mydata = list(x=x ,y=as.factor(y), geneid = as.character(1:nrow(x)),
genenames = rownames(x))
nsc_model = pamr.train(mydata, threshold=seq(0, 4, 0.1))
nsc_model_cv = pamr.cv(nsc_model, mydata)
pamr.plotcen(nsc_model, mydata, threshold = 1)
pamr.plotcen(nsc_model, mydata, threshold = 2.5)
print(nsc_model_cv)
pamr.plotcv(nsc_model_cv)
# Elastic Net
# Predictor Variables. The -1 removes the intercept component
x = model.matrix(Conference ~ ., train_emails)[,-1]
# Outcome variable
y = train_emails$Conference
model_elastic_net = glmnet(x, y, alpha = 0.5, family = "binomial")
# Support vector machine with "vanilladot" kernel.
# Vanilladot means "Linear Kernel Function"
#model_svm = svm(x = x, y = y, kernel = "linear")
model_svm = ksvm(Conference ~ ., train_emails, kernel = "vanilladot")
print(model_elastic_net)
summary(model_elastic_net)
plot(model_elastic_net)
# Orignal Data Set: emails
# H0: Feature has effect on Conference
# Ha: Feature has no effect on Conference
# 1) Calculate p-values
# 2) Assign ranks and sort
# 3) BH Critical Value
# 4) Find critical value and select all p < score
q_value = 0.05
## 1)
feature_names = c()
p_values = c()
for (i in 1:(ncol(emails)-1)) {
colname = colnames(emails)[i]
c_formula = paste(sep = "", colname, " ~ Conference")
# one.sided, less, greater
p_value = t.test(as.formula(c_formula), data = emails, alternative="two.sided")
feature_names = c(feature_names, colname)
p_values = c(p_values, p_value$p.value)
}
## 2)
bh_entries = data.frame(feature_names, p_values)
bh_entries = bh_entries[order(bh_entries$p_values, decreasing = FALSE),]
rownames(bh_entries) = NULL
## 3)
getCV_BH = function(i, m, Q) {
return(i/m*Q)
}
cv_scores = c()
for (j in 1:nrow(bh_entries)) {
current_val = getCV_BH(as.numeric(rownames(bh_entries)[j]), nrow(bh_entries), q_value)
cv_scores = c(cv_scores, current_val)
}
#cv_col = apply(bh_entries, 1, function(x) getCV_BH(as.numeric(rownames(bh_entries)), nrow(bh_entries), q_value))
bh_entries = data.frame(bh_entries, cv_scores)
## 4)
## Find all rows where p_values < cv_scores
bh_entries = bh_entries[bh_entries$p_values < bh_entries$cv_scores,]
kable(bh_entries)